In [ ]:
%%HTML
<style>
.container { width:100% }
</style>
In this notebook we investigate the influence of rounding and subclassing on linear regression. To begin, we import all the libraries we need.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as lm
We will work with artificially generated data. The independent variable X
is a numpy
array
of $\texttt{N}=400$ random numbers that have a normal distribution with
mean $\mu = 10$ and standard deviation $1$. The data is created from random numbers.
In order to be able to reproduce our results, we use the method numpy.random.seed
.
In [ ]:
np.random.seed(1)
N = 400
𝜇 = 10
X = np.random.randn(N) + 𝜇
The dependent variable Y
is created by adding some noise to the independent variable X
. This noise is
normally distributed with mean $0$ and standard deviation $0.5$.
In [ ]:
noise = 0.5 * np.random.randn(len(X))
Y = X + noise
We build a linear model for X
and Y
.
In [ ]:
model = lm.LinearRegression()
In order to use SciKit-Learn we have to reshape the array X into a matrix.
In [ ]:
X = np.reshape(X, (len(X), 1))
We train the model and compute its score.
In [ ]:
M = model.fit(X, Y)
M.score(X, Y)
In order to plot the data together with the linear model, we extract the coefficients.
In [ ]:
ϑ0 = M.intercept_
ϑ1 = M.coef_[0]
We plot Y
versus X
and the linear regression line.
In [ ]:
xMax = np.max(X) + 0.2
xMin = np.min(X) - 0.2
%matplotlib inline
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.scatter(X, Y, c='b') # 'b' is blue color
plt.xlabel('X values')
plt.ylabel('true values + noise')
plt.title('Influence of rounding on explained variance')
plt.show(plt.plot([xMin, xMax], [ϑ0 + ϑ1 * xMin, ϑ0 + ϑ1 * xMax], c='r'))
As we want to study the effect of rounding, the values of the dependent variable X
are rounded to the nearest integer. To this end, the values are transformed to another unit, rounded and then transformed back to the original unkit. This way we can investigate how the performance of linear regression degrades if the precision of the measurements of the independent variable is low.
In [ ]:
X = np.round(X * 0.8) / 0.8
We create a new linear model, fit it to the data and compute its score.
In [ ]:
model = lm.LinearRegression()
M = model.fit(X, Y)
M.score(X, Y)
We can see that the performance of the linear model has degraded considerably.
In [ ]:
ϑ0 = M.intercept_
ϑ1 = M.coef_[0]
xMax = max(X) + 0.2
xMin = min(X) - 0.2
plt.figure(figsize=(12, 10))
sns.set(style='darkgrid')
plt.scatter(X, Y, c='b')
plt.plot([xMin, xMax], [ϑ0 + ϑ1 * xMin, ϑ0 + ϑ1 * xMax], c='r')
plt.xlabel('rounded X values')
plt.ylabel('true X values + noise')
plt.title('Influence of rounding on explained variance')
plt.show()
Next, we investigate the effect of subclassing. We will only keep those values such that $X > 11$.
In [ ]:
X.shape
In [ ]:
selectorX = (X > 11)
selectorY = np.reshape(selectorX, (N,))
XS = X[selectorX]
XS = np.reshape(XS, (len(XS), 1))
YS = Y[selectorY]
Again, we fit a linear model.
In [ ]:
model = lm.LinearRegression()
M = model.fit(XS, YS)
M.score(XS, YS)
We see that the performance of linear regression has degraded considerably. Let's plot this.
In [ ]:
ϑ0 = M.intercept_
ϑ1 = M.coef_[0]
xMax = max(XS) + 0.2
xMin = min(XS) - 0.2
plt.figure(figsize=(12, 10))
sns.set(style='darkgrid')
plt.scatter(XS, YS, c='b')
plt.plot([xMin, xMax], [ϑ0 + ϑ1 * xMin, ϑ0 + ϑ1 * xMax], c='r')
plt.xlabel('rounded X values')
plt.ylabel('true X values + noise')
plt.title('Influence of subclassing on explained variance')
plt.show()
In [ ]: